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Abstract 


This paper proposes a tree-based convo¬ 
lutional neural network (TBCNN) for dis¬ 
criminative sentence modeling. Our mod¬ 
els leverage either constituency trees or 
dependency trees of sentences. The tree- 
based convolution process extracts sen¬ 
tences’ structural features, and these fea¬ 
tures are aggregated by max pooling. 
Such architecture allows short propaga¬ 
tion paths between the output layer and 
underlying feature detectors, which en¬ 
ables effective structural feature learning 
and extraction. We evaluate our models 
on two tasks: sentiment analysis and ques¬ 
tion classification. In both experiments, 
TBCNN outperforms previous state-of- 
the-art results, including existing neural 
networks and dedicated feature/rule engi¬ 
neering. We also make efforts to visualize 
the tree-based convolution process, shed¬ 
ding light on how our models work. 

1 Introduction 


Discriminative sentence modeling aims to capture 
sentence meanings, and classify sentences accord¬ 
ing to certain criteria (e.g., sentiment). It is related 
to various tasks of interest, and has attracted much 


attention in the NLP community (Allan et ah. 


2003 Su and Markert, 2008 ( |Zhao et ah, 2015 1. 


Feature engineering—for example, n-gram fea¬ 
tures ( Cui et ah, 2006| ), dependency subtree fea¬ 
tures ( Nakagawa et ah, 2010| |, or more dedicated 
ones (Silva et ah, 20111—can play an important 
role in modeling sentences. Kernel machines, e.g.. 


SVM, are exploited in Moschitti (20061 and Re- 


ichartz et al. (20101 by specifying a certain mea¬ 


sure of similarity between sentences, without ex¬ 
plicit feature representation. 


* These authors contribute equally to this paper. 
^ Corresponding author. 


Recent advances of neural networks bring new 
techniques in understanding natural languages. 


and have exhibited considerable potential. Bengio 


et al. (2003) and |Mikolov et al. (2013| propose un¬ 
supervised approaches to learn word embeddings, 
mapping discrete words to real-valued vectors in 
a meaning space. |Le and Mikolov (2014[| ex¬ 


tend such approaches to learn sentences’ and para¬ 
graphs’ representations. Compared with human 
engineering, neural networks serve as a way of au¬ 
tomatic feature learning (Bengio et al., 20131. 

Two widely used neural sentence models are 
convolutional neural networks (CNNs) and recur¬ 
sive neural networks (RNNs). CNNs can extract 
words’ neighboring features effectively with short 
propagation paths, but they do not capture inher¬ 
ent sentence structures (e.g., parsing trees). RNNs 
encode, to some extent, structural information by 
recursive semantic composition along a parsing 
tree. However, they may have difficulties in learn¬ 
ing deep dependencies because of long propaga¬ 


tion paths ( [Erhan et al., 2009) . (CNNs/RNNs and 
a variant, recurrent networks, will be reviewed in 
Section!^) 


A curious question is whether we can com¬ 
bine the advantages of CNNs and RNNs, i.e., 
whether we can exploit sentence structures (like 
RNNs) effectively with short propagation paths 
(like CNNs). 


In this paper, we propose a novel neural ar¬ 
chitecture for discriminative sentence modeling, 
called the Tree-Based Convolutional Neural Net¬ 
work (TBCNN). Our models can leverage differ¬ 
ent sentence parsing trees, e.g., constituency trees 
and dependency trees. The model variants are de¬ 
noted as c-TBCNN and d-TBCNN, respectively. 
The idea of tree-based convolution is to apply a set 
of subtree feature detectors, sliding over the entire 
parsing tree of a sentence; then pooling aggregates 
these extracted feature vectors by taking the max¬ 
imum value in each dimension. One merit of such 
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Figure 1: A comparison of information flow in the convolutional neural network (CNN), the recursive 
neural network (RNN), and the tree-based convolutional neural network (TBCNN). 


architecture is that all features, along the tree, have 
short propagation paths to the output layer, and 
hence structural information can be learned effec¬ 
tively. 

TBCNNs are evaluated on two tasks, sentiment 
analysis and question classification; our models 
have outperformed previous state-of-the-art re¬ 
sults in both experiments. To understand how 
TBCNNs work, we also visualize the network by 
plotting the convolution process. We make our 
code and results available on our project websitej^ 


Convolution can extract neighboring informa¬ 
tion effectively. However, the features are 
“local”—words that are not in a same convolu¬ 
tion window do not interact with each other, even 


though they may be semantically related. Kalch- 


brenner et al. (2014]) build deep convolutional net¬ 


works so that local features can mix at high-level 
layers. Similar deep CNNs include Kim (2014] ) 
and |Hu et al. (2014] ). All these models are “flat,” 
by which we mean no structural information is 
used explicitly. 


2 Background and Related Work 

In this section, we present the background and re¬ 
lated work regarding two prevailing neural archi¬ 
tectures for discriminative sentence modeling. 


2.1 Convolutional Neural Networks 


Convolutional neural networks (CNNs), early 
used for image processing ( LeCun et ah, 19^ , 
turn out to be effective with natural languages 
as well. Figure [T^ depicts a classic convolu¬ 


tion process on a sentence (Collobert and Weston, 


20081. A set of fixed-width-window feature de¬ 


tectors slide over the sentence, and output the ex¬ 
tracted features. Let t be the window size, and 
iCi, • • • ,Xt G be ne-dimensional word em¬ 
beddings. The output of convolution, evaluated at 
the current position, is 

y = /(FF-[si;-- - ■,Xt] + h) 


where y G (ric is the number of feature detec¬ 
tors). W G jj g ]^nc ^j.g parame¬ 

ters; / is the activation function. Semicolons rep¬ 
resent column vector concatenation. After convo¬ 
lution, the extracted features are pooled to a fixed- 
size vector for classification. 


* https://sites.google.com/site/tbcnnsentence/ (This site is 
properly anonymized, and complies with the double-blind re¬ 
view requirement.) 


2.2 Recursive Neural Networks 


Recursive neural networks (RNNs), proposed in 
Socher et al. (2011b| l, utilize sentence parsing 
trees. In the original version, RNN is built upon 
a binarized constituency tree. Leaf nodes corre¬ 
spond to words in a sentence, represented by Ug- 
dimensional embeddings. Non-leaf nodes are sen¬ 
tence constituents, coded by child nodes recur¬ 
sively. Let node p be the parent of ci and C 2 , vec¬ 
tor representations denoted as p, ci, and C 2 . The 
parent’s representation is composited by 

P = f{W ■[ci;c2] + b) (1) 


where W and b are parameters. This process is 
done recursively along the tree; the root vector is 
then used for supervised classification (Figure[^). 

Dependency parsing and the combinatory cate¬ 
gorical grammar can also be exploited as RNNs’ 
skeletons (Hermann and Blunsom, 20T3| lyyer et 


ah, 20141. Irsoy and Cardie (2014| ) build deep 
RNNs to enhance information interaction. Im¬ 
provements for semantic compositionality include 
matrix-vector interaction ( |S ocher et ah, 2012| ), 
tensor interaction ( Socher. et ah, 2013] ). They are 
more suitable for capturing logical information in 
sentences, such as negation and exclamation. 

One potential problem of RNNs is that the long 
propagation paths—through which leaf nodes are 
connected to the output layer—may lead to infer- 












































mation loss. Thus, RNNs bury illuminating in¬ 
formation under a complicated neural architecture. 
Further, during back-propagation over a long path, 
gradients tend to vanish (or blow up), which makes 


training difficult (Erhan et ah, 20091. Long short 
term memory (LSTM), first proposed for model¬ 


ing time-series data (Hochreiter and Schmidhuber, 


19971, is integrated to RNNs to alleviate this prob¬ 


lem ( Tai et ah, 2015[[Le and Zuidema, 2015[ |Zhu 


et ah, 20151. 


Recurrent networks. A variant class of RNNs 


is the recurrent neural network (Bengio et ah. 


1994 Shang et ah, 20151, whose architecture is 


a rightmost tree. In such models, meaningful tree 
structures are also lost, similar to CNNs. 


3 Tree-based Convolution 

This section introduces the proposed tree-based 
convolutional neural networks (TBCNNs). Figure 
depicts the convolution process on a tree. 

First, a sentence is converted to a parsing tree, 
either a constituency or dependency tree. The 
corresponding model variants are denoted as c- 
TBCNN and d-TBCNN. Each node in the tree is 
represented as a distributed, real-valued vector. 

Then, we design a set of fixed-depth subtree fea¬ 
ture detectors, called the tree-based convolution 
window. The window slides over the entire tree 
to extract structural information of the sentence, 
illustrated by a dashed triangle in Figure [^. For¬ 
mally, let us assume we have t nodes in the con¬ 
volution window, * 1 , • • • ,Xt, each represented as 
an rig-dimensional vector. Let ric be the number 
of feature detectors. The output of the tree-based 
convolution window, evaluated at the current sub¬ 
tree, is given by the following generic equation. 

y = f(^^Wi-x, + b^ ( 2 ) 

where Wi G weight parameter asso¬ 

ciated with node Xi', b £ is the bias term. 

Extracted features are thereafter packed into 
one or more fixed-size vectors by max pooling, 
that is, the maximum value in each dimension is 
taken. Finally, we add a fully connected hidden 
layer, and a softmax output layer. 

From the designed architecture (Figure [T]:), we 
see that our TBCNN models allow short propaga¬ 
tion paths between the output layer and any posi¬ 
tion in the tree. Therefore structural feature learn¬ 
ing becomes effective. 


Several main technical points in tree-based con¬ 
volution include: (1) How can we represent hid¬ 
den nodes as vectors in constituency trees? (2) 
How can we determine weights, Wi, for depen¬ 
dency trees, where nodes may have different num¬ 
bers of children? (3) How can we pool varying 
sized and shaped features to fixed-size vectors? 

In the rest of this section, we explain model 


variants in detail. Particularly, Subsections 3.1 and 


3.2 address the first and second problems; Sub¬ 


section 3.3 deals with the third problem by intro¬ 


ducing several pooling heuristics. Subsection 3.4 
presents our training objective. 

3.1 c-TBCNN 

Figure illustrates an example of the con¬ 
stituency tree, where leaf nodes are words in the 
sentence, and non-leaf nodes represent a grammat¬ 
ical constituent, e.g., a noun phrase. Sentences 
are parsed by the Stanford parser^ further, con¬ 
stituency trees are binarized for simplicity. 

One problem of constituency trees is that non¬ 
leaf nodes do not have such vector representations 
as word embeddings. Our strategy is to pretrain 
the constituency tree with an RNN by Equation [T] 


(Socher et ah, 201 lb I. After pretraining, vector 


representations of nodes are fixed. 

We now consider the tree-based convolution 
process in c-TBCNN with a two-layer-subtree 
convolution window, which operates on a parent 
node p and its direct children ci and Cr, their vec¬ 
tor representations denoted as p, ci, and c^. The 
convolution equation, specific for c-TBCNN, is 


y = f p + -Q + ifJ")- c, + bW) 

(c) fcl fc) 

where Wp , Wi , and Wr are weights asso¬ 
ciated with the parent and its child nodes. Su¬ 
perscript (c) indicates that the weights are for c- 
TBCNN. For leaf nodes, which do not have chil¬ 
dren, we set Cl and to be 0. 

Tree-based convolution windows can be ex¬ 
tended to arbitrary depths straightforwardly. The 
complexity is exponential to the depth of the 
window, but linear to the number of nodes. 
Hence, tree-based convolution, compared with 
“flat” CNNs, does not add to computational cost, 
provided the same amount of information to pro¬ 
cess at a time. In our experiments, we use convo¬ 
lution windows of depth 2. 

^ http://nlp.stanford.edu/software/lex-parser.shtml 
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Figure 2: Tree-based convolution in (a) c-TBCNN, and (b) d-TBCNN. The parsing trees correspond to 
the sentence “I loved it.” The dashed triangles illustrate a shared-weight convolution window sliding over 
the tree. For clarity, only two positions are drawn in c-TBCNN. Notice that dotted arrows are not part of 
neural connections; they merely indicate the topologies of tree structures. Specially, an edge a —)• 5 in 
the dependency tree refers to a being governed by b with dependency type r. 


3.2 d-TBCNN 


Dependency trees are another representation of 
sentence structures. The nature of dependency 
representation leads to d-TBCNN’s major dif¬ 
ference from traditional convolution: there ex¬ 
ist nodes with different numbers of child nodes. 
This causes trouble if we associate weight param¬ 
eters according to positions in the window, which 


is standard for traditional convolution, e.g., Col- 


lobert and Weston (20081 or c-TBCNN. 


To overcome the problem, we extend the no¬ 
tion of convolution by assigning weights accord¬ 
ing to dependency types (e.g, nsub j) rather than 
positions. We believe this strategy makes much 


sense because dependency types (de Marneffe et 


al., 20061 reflect the relationship between a gov¬ 


erning word and its child words. To be concrete, 
the generic convolution formula (Equation for 
d-TBCNN becomes 


V i=i 

where Wp^'^ is the weight parameter for the par¬ 
ent p (governing word); is the weight for 

child Cj, who has grammatical relationship r\ci] 
to its parent, p. Superscript {d) indicates the pa¬ 
rameters are for d-TBCNN. Note that we keep 15 
most frequently occurred dependency types; oth¬ 
ers appearing rarely in the corpus are mapped to 
one shared weight matrix. 

Both c-TBCNN and d-TBCNN have their own 
advantages: d-TBCNN exploits structural features 
more efficiently because of the compact expres¬ 
siveness of dependency trees; c-TBCNN may be 
more effective in integrating global features due 
to the underneath pretrained RNN. 


3.3 Pooling Heuristics 

As different sentences may have different lengths 
and tree structures, the extracted features by tree- 
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Figure 3: Pooling heuristics, (a) Global pooling, 
(b) 3-slot pooling for c-TBCNN. (c) fc-slot pooling 
for d-TBCNN. 


based convolution also have topologies varying in 


size and shape. Dynamic pooling (Socher et al.. 


2011a|l is a common technique for dealing with 


this problem. We propose several heuristics for 
pooling along a tree structure. Our generic de¬ 
sign criteria for pooling include: (1) Nodes that 
are pooled to one slot should be “neighboring” 
from some viewpoint. (2) Each slot should have 
similar numbers of nodes, in expectation, that are 
pooled to it. Thus, (approximately) equal amount 
of information is aggregated along different parts 
of the tree. Eollowing the above intuition, we pro¬ 
pose pooling heuristics as follows. 


• Global pooling. All features are pooled to 
one vector, shown in Eigure [^. We take 
the maximum value in each dimension. This 
simple heuristic is applicable to any structure, 
including c-TBCNN and d-TBCNN. 

• 3-slot pooling for c-TBCNN. To preserve 
more information over different parts of con¬ 
stituency trees, we propose 3-slot pooling 
(Eigure |^). If a tree has maximum depth 
d, we pool nodes of less than a ■ d lay¬ 
ers to a TOP slot (a is set to 0.6); lower 

































Task 

Data sampies 

Labei 

Sentiment 

Anaiysis 

Offers that rare combination of entertainment and education. 

++ 

An ideaiistic iove story that brings out the iatent i5-year-oid romantic in everyone. 
Its mysteries are transparentiy obvious, and it’s too siowiy paced to be a thriiier. 

-h 

Question 

Ciassification 

What is the temperature at the center of the earth? 

What state did the Battie of Bighorn take piace in? 

number 

location 


Table 1: Data samples in sentiment analysis and question classification. In the first task, “++” refers to 
strongly positive; “+” and ” refer to positive and negative, respectively. 


nodes are pooled to slots LOWERXEFT or 
EOWER_RIGHT according to their relative 
position with respect to the root node. 

Eor a constituency tree, it is not completely 
obvious how to pool features to more than 
3 slots and comply with the aforementioned 
criteria at the same time. Therefore, we re¬ 
gard 3-slot pooling for c-TBCNN is a “hard 
mechanism” temporarily. Further improve¬ 
ment can be addressed in future work. 

• A:-slot pooling for d-TBCNN. Different from 
constituency trees, nodes in dependency trees 
are one-one corresponding to words in a sen¬ 
tence. Thus, a total order on features (af¬ 
ter convolution) can be defined according 
fo fheir corresponding word orders. For k- 
slof pooling, we can adopf an “equal allo¬ 
cation” sfrafegy, shown in Figure [^. Eel 
i be fhe posifion of a word in a senfence 
(f = 1, 2, • • • , n). Ifs exfracfed fealure vec¬ 
tor is pooled fo fhe y-fh slof, if 


U - 1 ) 


n 

1 


<i< 3 


n 

~k 


We assess fhe efficacy of pooling quanfifafively 
in Section 4.3.1 As we shall see by fhe exper- 
imenfal resulfs, complicafed pooling mefhods do 
preserve more informalion along free sfrucfures to 
some exfenf, buf fhe effecl is nol large. TBCNNs 
are nol very sensilive fo pooling mefhods. 


3.4 Training Objective 


Afler pooling, information is packed info one or 
more fixed-size vecfors (slofs). We add a hidden 
layer, and fhen a softmax layer fo predicf fhe prob- 
abilify of each largel label in a classification lask. 
The error function of a sample is fhe sfandard cross 
enfropy loss, i.e., J = — Yll=i Vi’ where t is 
fhe ground frufh (one-hof represenfed), y fhe ouf- 
puf by softmax, and c fhe number of classes. To 
regularize our model, we apply bofh (.2 penally and 
dropoul ( Srivasfava et al., 2014[ ). Training defails 
are further presented in Secfion |4~T] and [4~^ 


4 Experimental Results 


In this section, we evaluate our models with two 
tasks, sentiment analysis and question classifica¬ 
tion. We also conduct quantitative and qualitative 


model analysis in Subsection 4.3 


4.1 Sentiment Analysis 

4.1.1 The Task and Dataset 

Sentiment analysis is a widely studied task for 
discriminative sentence modeling. The Stanford 
sentiment treeban]<0 consists of more than 10,000 
movie reviews. Two settings are considered for 
sentiment prediction: (1) fine-grained classifi¬ 
cation with 5 labels (strongly positive, 
positive, neutral, negative, and 
strongly negative), and (2) coarse-gained 
polarity classification with 2 labels (positive 
versus negative). Some examples are shown in 
Table[T] We use the standard split for training, val¬ 
idating, and testing, containing 8544/1101/2210 
sentences for 5-class prediction. Binary classifi¬ 
cation does not contain the neutral class. 

In the dataset, phrases (sub-sentences) are also 
tagged with sentiment labels. RNNs deal with 
them naturally during the recursive process. We 
regard sub-sentences as individual samples during 


training, like Kalchbrenner et al. (20141 and Fe 


and Mikolov (20141. The training set therefore has 
more than 150,000 entries in total. For validating 
and testing, only whole sentences (root labels) are 
considered in our experiments. 

Both c-TBCNN and d-TBCNN use the Stanford 
parser for data preprocessing. 


4.1.2 Training Details 

This subsection describes training details for d- 
TBCNN, where hyperparameters are chosen by 
validation. c-TBCNN is mostly tuned syn¬ 
chronously (e.g., optimization algorithm, activa¬ 
tion function) with some changes in hyperparam¬ 
eters. c-TBCNN’s settings can be found on our 
(anonymized) website. 

^http://nlp. stanford.edu/sentiment/ 



















Group 

Method 

5-class accuracy 

2-class accuracy 

Reported iu 

Baseline 

SVM 

Naive Bayes 

40.7 

41.0 

79.4 

81.8 

Socher. et al. (2013) 

Socher. et al. (20131 

CNNs 

1-layer convolution 
Deep CNN 
Non-static 
Multichannel 

37.4 

48.5 

48.0 

47.4 

77.1 

86.8 

87.2 

88.1 

Kalchbrenr 

Kalchbrenr 

Krm (2014 

Kim (2014 

leretal. (2014 

ler et al. (2014 

1 

1 

1 

1 

RNNs 

Basic 

Matrix-vector 

Tensor 

Tree LSTM (variant 1) 
Tree LSTM (variant 2) 
Tree LSTM (variant 3) 
Deep RNN 

43.2 

44.4 

45.7 

48.0 

50.6 

49.9 

49.8 

82.4 

82.9 

85.4 

86.9 

88.0 

86.6^ 

Socher. et al. (2013 
Socher. et al. (2013 
Socher. et al. (2013 

/huetal. (2015 

Taietal. (2015: 

Le and Zuidema (2t 
Irsoy and Cardie (2 

115 

J14 

1 

Recurrent 

LSTM 

bi-LSTM 

45.8 

49.1 

86.7 

86.8 

Taietal. (2015 

iai et al. (2015 

1 

Vector 

Word vector avg. 
Paragraph vector 

32.7 

48.7 

80.1 

87.8 

Socher. et al. (2 

Le and Mikolo’ 

tel 

/(2014) 

TBCNNs 

c-TBCNN 

d-TBCNN 

50.4 

51.4 

86A^ 

87.91' 

Our implementation 

Our implementation 


Table 2: Accuracy of sentiment prediction (in percentage). For 2-class prediction, “f” remarks indicate 
that the network is transferred directly from that of 5-class. 


In our d-TBCNN model, the number of units 
is 300 for convolution and 200 for the last hid¬ 
den layer. Word embeddings are 300 dimensional. 


pretrained ourselves using word2vec (Mikolov 


et al., 20131 on the English Wikipedia corpus. 2- 
slot pooling is applied for d-TBCNN. (c-TBCNN 
uses 3-slot pooling.) 

To train our model, we compute gradient by 
back-propagation and apply stochastic gradient 
descent with mini-batch 200. We use ReLU ( |Nair| 
and Hinton, 2010| ) as the activation function . 

For regularization, we add I 2 penalty for 
weights with a coefficient of 10“®. Dropout (Sri- 
vastava et al., 20T4| is further applied to both 
weights and embeddings. All hidden layers are 
dropped out by 50%, and embeddings 40%. 

4.1.3 Performance 

Table compares our models to state-of-the-art 
results in the task of sentiment analysis. For 5- 
class prediction, d-TBCNN yields 51.4% accu¬ 
racy, outperforming the previous state-of-the-art 
result, achieved by the RNN based on long-short 
term memory (|Tai et al., 2015|l. c-TBCNN is 


slightly worse. It achieves 50.4% accuracy, rank¬ 
ing third in the state-of-the-art list (including our 
d-TBCNN model). 

Regarding 2-class prediction, we adopted a sim¬ 
ple strategy in Irsoy and Cardie (20 14[)p1 where the 


Richard Socher, who first applies neural networks to this 
task, thinks direct transfer is fine for binary classification. We 
followed this strategy for simplicity as it is non-trivial to deal 
with the neutral sub-sentences in the training set if we train a 
separate model. Our website reviews some related work and 


5-class network is “transferred” directly for binary 
classification, with estimated target probabilities 
(by 5-way softmax) reinterpreted for 2 classes. 
(The neutral class is discarded as in other stud¬ 
ies.) This strategy enables us to take a glance at the 
stability of our TBCNN models, but places itself 
in a difficult position. Nonetheless, our d-TBCNN 
model achieves 87.9% accuracy, ranking third in 
the list. 

In a more controlled comparison—with shal¬ 
low architectures and the basic interaction (lin¬ 
early transformed and non-linearly squashed)— 
TBCNNs, of both variants, consistently outper¬ 
form RNNs ( Socher et al., 201 lb| ) to a large ex¬ 
tent (50.4-51.4% versus 43.2%); they also con¬ 
sistently outperform “flat” CNNs by more than 
10%. Such results show that structures are im¬ 
portant when modeling sentences; tree-based con¬ 
volution can capture these structural information 
more effectively than RNNs. 

We also observe d-TBCNN achieves higher per¬ 
formance than c-TBCNN. This suggests that com¬ 
pact tree expressiveness is more important than in¬ 
tegrating global information in this task. 

4.2 Question Classification 

We further evaluate TBCNN models on a ques¬ 
tion classification taskl3 The dataset contains 
5452 annotated sentences plus 500 test sam¬ 
ples in TREC 10. We also use the stan¬ 
dard split, like Silva et al. (20111. Target la- 


provides more discussions. 

^ http://cogcomp.cs.illinois.edu/Data/QA/QC/ 










































































Method 

Acc. (%) 

Reported in 

Model 

Pooling method 

5-class accuracy (%) 

SVM 

10k features + 60 rules 

95.0 

Silva etal. (2011 1 

c-TBCNN 

Global 

3-slot 

48.48 ± 0.54 

48.69 ± 0.40 


CNN-non-static 

93.6 

Kim(2014 

1 


CNN-mutlichannel 

92.2 

Kim (2014 

1 


RNN 

90.2 

Zhao et al. 

(2015 


Deep-CNN 

93.0 

Kalchbrenner et al. (2014 

Ada-CNN 

92.4 

Zhao et al. (2015 ) 


d-TBCNN 



c-TBCNN 

94.8 

Our implementation 


d-TBCNN 

96.0 

Our implementation 


Table 3: Accuracy of 6-way question classification. 

bels contain 6 classes, namely abbreviation, 
entity, description, human, location, 
and numeric. Some examples are also shown in 
Tabled 

We chose this task to evaluate our models be¬ 
cause the number of training samples is rather 
small, so that we can know TBCNNs’ perfor¬ 
mance when applied to datasets of different sizes. 
To alleviate the problem of data sparseness, we set 
the dimensions of convolutional layer and the last 
hidden layer to 30 and 25, respectively. We do 
not back-propagate gradient to embeddings in this 
task. Dropout rate for embeddings is 30%; hidden 
layers are dropped out by 5%. 

Table compares our models to various other 
methods. The first entry presents the previous 
state-of-the-art result, achieved by traditional fea¬ 
ture/rule engineering ([Silva et al., 2011]). Their 


method utilizes more than 10k features and 60 
hand-coded rules. On the contrary, our TBCNN 
models do not use a single human-engineered fea¬ 
ture or rule. Despite this, c-TBCNN achieves 
similar accuracy compared with feature engineer¬ 
ing; d-TBCNN pushes the state-of-the-art result to 
96%. To the best of our knowledge, this is the first 
time that neural networks beat dedicated human 
engineering in this question classification task. 

The result also shows that both c-TBCNN and 
d-TBCNN reduce the error rate to a large extent, 
compared with other neural architectures in this 
task. 

4.3 Model Analysis 

In this part, we analyze our models quantitatively 
and qualitatively in several aspects, shedding some 
light on the mechanism of TBCNNs. 


2-slot 


49.94 ± 0.63 


Table 4: Accuracies of different pooling methods, 
averaged over 5 random initializations. We chose 
sensible hyperparameters manually in advance to 
make a fair comparison. This leads to performance 
degradation (1-2%) vis-a-vis Table 



Setence length 

Figure 4: Accuracies versus sentence lengths. 

One reasonable protocol for comparison is to 
tune all hyperparameters for each setting and com¬ 
pare the highest accuracy. This methodology, 
however, is too time-consuming, and depends 
largely on the quality of hyperparameter tuning. 
An alternative is to predehne a set of sensible hy¬ 
perparameters and report the accuracy under the 
same setting. In this experiment, we chose the 
latter protocol, where hidden layers are all 300- 
dimensional; no £2 penalty is added. Each config¬ 
uration was run five fimes wifh differenf random 
inifializafions. We summarize fhe mean and sfan- 
dard deviafion in Table |4] 

As fhe resulfs imply, complicafed pooling is bef- 
fer fhan global pooling fo some degree for bofh 
model varianfs. Buf fhe effecl is nol sfrong; our 
models are nol fhaf sensifive fo pooling mefhods, 
which mainly serve as a necessify for dealing wifh 
varying-slruclure dala. In our experimenls, we ap¬ 
ply 3-slol pooling for c-TBCNN and 2-slol pool¬ 
ing for d-TBCNN. 

Comparing wifh olher sludies in fhe liferalure, 
we also nolice fhaf pooling is very effeclive and ef- 


hcienf in informalion galhering. Irsoy and Cardie 


(20141 reporl 200 epochs for Iraining a deep RNN, 
which achieves 49.8% accuracy in fhe 5-class sen- 
limenf classificalion. Our TBCNNs are lypically 
Irained wilhin 25 epochs. 


4.3.1 The Effect of Pooling 

The exlracled feafures by free-based convolulion 
have topologies varying in size and shape. We pro¬ 
pose in Seclion 3.3 several heurislics for pooling. 
This subsecfion aims fo provide a fair comparison 
among Ihese pooling mefhods. 


4.3.2 The Effect of Sentence Lengths 

We analyze how sentence lengths affect our mod¬ 
els. Sentences are split into 7 groups by length, 
with granularity 5. A few too long or too short 
sentences are grouped together for smoothing; the 
numbers of sentences in each group vary from 126 
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Figure 5: Visualizing how features (after convolution) are related to the sentiment of a sentence. The 
sample corresponds a sentence in the dataset, “The stunning dreamlike visual will impress even those 
viewers who have little patience for Euro-film pretension.” The numbers in brackets denote the fraction 
of a node’s features that are gathered by the max pooling layer (also indicated by colors). 


to 457. Figurej^presents accuracies versus lengths 
in TBCNNs. For comparison, we also reimple¬ 
mented RNN, achieving 42.7% overall 
slightly worse than 43.2% reported in 
al. (2011bl l. Thus, we think our reimplementation 
is fair and that the comparison is sensible. 

We observe that c-TBCNN and d-TBCNN yield 
very similar behaviors. They consistently outper¬ 
form the RNN in all scenarios. We also notice the 
gap, between TBCNNs and RNN, increases when 
sentences contain more than 20 words. This re¬ 
sult confirms our fheorefical analysis in Secfion 
[2]— for long senfences, fhe propagafion pafhs in 
RNNs are deep, causing RNNs’ difficully in in- 
formalion processing. By confrasl, our models ex¬ 
plore sfrucfural informalion more effectively wifh 
free-based convolufion. As information from any 
parf of fhe free can propagafe fo fhe oufpuf layer 
wifh short pafhs, TBCNNs are more capable for 
senfence modeling, especially for long senfences. 

4.3.3 Visualization 

Visualizafion is imporfanf fo undersfanding fhe 
mechanism of neural nefworks. For TBCNNs, we 
would like fo see how fhe exfracfed fealures (af¬ 
ter convolution) are furlher processed by fhe max 
pooling layer, and ulfimafely relafed fo fhe super¬ 
vised fask. 

To show fhis, we frace back where fhe max 
pooling layer’s fealures come from. For each di¬ 
mension, fhe pooling layer chooses fhe maximum 
value from fhe nodes fhaf are pooled fo if. Thus, 
we can counf fhe fraction in which a node’s fea¬ 
lures are galhered by pooling. Inluifively, if a 
node’s fealures are more relafed fo fhe fask, fhe 
fraclion lends fo be larger, and vice versa. 

Figure l^illuslrafes an example processed by d- 
TBCNN in fhe fask of senfimenl analysis]^ Here, 

® We only have space to present one example in the paper. 


accuracy, 
S ocher el 


we applied global pooling because information 
fracing is more sensible wifh one pooling slof. 
As shown in fhe figure, free-based convolution 
can effeclively exlracl informalion relevanl fo fhe 
fask of inleresl. The 2-layer windows correspond¬ 
ing fo “visual will impress viewers” “the stunning 
dreamlike visual,” say, are discriminative fo fhe 
senlence’s senfimenl. Hence, large fraclions (0.24 
and 0.19) of Iheir fealures, after convolution, are 
galhered by pooling. On fhe ofher hand, words 


like the, will, even are known as slop words (Fox, 


19891. They are moslly noninformalive for sen- 


limenl; hence, no (or minimal) fealures are galh¬ 
ered. Such resulls are consislenl wifh human infu- 
ilion. 

We furlher observe lhal free-based convolufion 
does infegrale informalion of differenl words in 
fhe window. For example, fhe word stunning ap¬ 
pears in Iwo windows: (a) fhe window “stunning” 
ilself, and (6) fhe window of “the stunning dream¬ 
like visual,” wifh roof node visual, stunning acting 
as a child. We see lhal Window b is more rel¬ 
evanl fo fhe ullimale senfimenl lhan Window a, 
wifh fractions 0.19 versus 0.07, even fhough fhe 
roof visual ilself is neulral in senfimenl. In facl. 
Window a has a larger fraclion lhan fhe sum of ifs 
children’s (fhe windows of “the,” “stunning^ and 
“dreamlike”). 


5 Conclusion 

In fhis paper, we proposed a novel neural discrim- 
inafive senfence model based on senfence parsing 
sfruclures. Our model can be builf upon eifher 
consfiluency frees (denoted as c-TBCNN) or de¬ 
pendency frees (d-TBCNN). 

This example was not chosen deliberately. Similar traits can 
be found through out the entire gallery, available on our web¬ 
site. Also, we only present d-TBCNN, noticing that depen¬ 
dency trees are intrinsically more suitable for visualization 
since we know the “meaning” of every node. 










Both variants have achieved high performance 
in sentiment analysis and question classification. 
d-TBCNN is slightly better than c-TBCNN in our 
experiments, and has outperformed previous state- 
of-the-art results in both tasks. The results show 
that tree-based convolution can capture sentences’ 
structural information effectively, which is useful 
for sentence modeling. 
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